Hands-on curve fitting examples with Linear models, Neural Networks, Maximum Likelihood, Bayesian NN

Contents:

  1. Linear regression example and over-fitting mitigation via Ridge and Lasso
  2. Standard (nonlinear) neural network model
  3. Linear model with Neural Network frameworks
  4. Linear model with Tensorflow GradientTape
  5. Linear model with maximum likelihood under constant variance
  6. Linear model with maximum likelihood under non-constant variance
  7. Neural-Network Model using a Maximum Likelihood
  8. Conditional probability distribution modeling with Tensorflow Probability
  9. Bayesian neural network via variational inference/Flipout layer
  10. Bayesian neural network via Monte Carlo dropout/Dropout layer

This notebook is intended for interactive/hands-on learning to increase the familiarity with the art of neural network training. Using single-input single-output curve fitting examples with simulated data (e.g., a 100-data point baseline and a 10-data point small sample), we fit various models.

To my knowledge, setting up and training a neural network can be easy or difficult, depending on the context. When it is easy, it can be trained well in any way under different versions layer-weight structures and various training parameters. When it is hard, it can take time to find a right approach to specify the layers and set the right training parameters. Sometimes it is useful to train a model over multiple sets of training parameters (i.e., train with one set, change training parameters, and train some more). Neural network weights are iteratively updated, so one can save weights at the best performance metric and train some more to see if it can be improved.

Tensorflow tutorials are a great resource. Its examples, nonetheless, tend to focus on image processing examples such as hand-written number recognition and fashion item classification. For those who tend to make predictive models with numerical or categorical inputs, it appears useful to grasp how to train neural networks in the context of simple curve-fitting examples.

Even for a simple context of curve-fitting problem, we show there are various ways one could approach. It can be handy to know various ways to tackle a problem in the face of a difficult modeling situation. Maximum likelihood is often reliable when we have an explicit formula (i.e. distribution function). Tensorflow GradientTape lets one customize the neural network training process. Tensorflow probability (TFP) allows for conditional variance modeling via an arbitrary neural network. TFP also offers easy-to-use methods to implement Bayesian neural network specifications via Flipout and Dropout layers.

Remember to change model specs and experiment.

0. Data generation

Let's try a simple polynomial curve fitting example as in Chapter 1 of Bishop's book.

We generated data sets of medium (N=100) and small (N=10) sample sizes for our single-input curve fitting exercise. The underlying true data generating process is shown as a red curve. This serves as a case study for the general issue of over-fitting that can happen in machine learning. We begin our discussion with observing over-fitting in a linear model and applying regularization techniques (Lasso and Ridge) to mitigate it.

1. Linear Regression and Regularization Examples

1.1 Linear Regression Fit

A priori we generally do not know the right choice for the degree of polynomial fit in a given problem. We may try several versions of polynomials, visually inspect the fit, and/or calculate some statistics on the goodness of fit.

In our example,

We can completely visualize this single-input example for how a curve fits the data, and it is helpful to have a basic intuition about under-fitting and over-fitting.

1.2 Regularized model: Ridge, Lasso

Ridge and Lasso impose an additional cost in the minimization problem based on how far parameters deviate from zero.

Ridge and Lasso differ in their distance measure (the deviation from zero) in that Ridge uses the squared distance and Lasso uses the absolute distance.

1. Neural Network (Standard/Nonlinear Fit)

We first fit a non-linear neural network as it is the standard usage. That is, the neural network will figure out how to fit a layered weight structure and map from a single input $[x]$ to the target output .

With small sample

Not bad!? So, one could estimate a Neural Network model with 10 data points...I wasn't sure.

Try chancing structure and optimizer etc to see how easy it is to train and obtain reasonable results.

2. Linear Model using Neural Network

This time we manually feed expanded inputs $[x, x^2, x^3]$ and fit a linear model with neural networks. (Just to show that we could do that.)

We estimate a 3rd degree polynomial given by $$y=b + a_1*x + a_2 * x^2 + a_3 * x^3 + \varepsilon$$ by following the example above.

3. Linear Model using tf.GradientTape updates

One could also estimate model parameters by iteratively applying gradient updates.

Why tf.GradientTape (https://www.tensorflow.org/guide/autodiff )?

tf.GradientTape is one of the most potent tools a machine learning engineer can have in their arsenal — its style of programming combines the beauty of mathematics with the power and simplicity of TensorFlow and Keras. https://medium.com/analytics-vidhya/tf-gradienttape-explained-for-keras-users-cc3f06276f22

The most useful application of Gradient Tape is when you design a custom layer in your keras model for example--or equivalently designing a custom training loop for your model. [...] GradientTape is a mathematical tool for automatic differentiation (autodiff), which is the core functionality of TensorFlow. It is a key part of performing the autodiff. https://stackoverflow.com/questions/53953099/what-is-the-purpose-of-the-tensorflow-gradient-tape

Thus, it may be useful to know tf.GradientTape method that gives more flexibility in specifying how the model weights are updated.

Let's start with an example with a simple shape.

That worked well. Now let's try for our simulated data.

We can make this more efficient and faster by compiling by adding a tf.function decorator.
https://www.tensorflow.org/api_docs/python/tf/function

Note that GradientTape can be used with Keras' models as well as we show later.

4. Linear Model fitting using Maximum Likelihood (constant variance)

This time, we will use a formula (normal density) to fit the data.

Let $f(y,\mu,\sigma)$ be the density of a normal distribution at $y$, given the parameters for mean $\mu$ and standard deviation $\sigma$:

$$ f(y, \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(y - \mu)^2}{2 \sigma^2}}. $$

Keras makes it easy to use ML modeling by simply converting a target density function (e.g., the formula above) into a negative log likelihood and passing this as the loss function in Keras's model.

When taking the negative log of the above formula ($ - \ln f(y, \mu, \sigma) $) and minimizing it, $\mu$ is independent of $\sigma$. Thus, we don't need to estimate $\sigma$ and give some arbitrary, fixed value.

With small sample

This appears that this simple model works for the small sample case.

5. Linear Model fitting using Maximum Likelihood (Non-constant variance)

The previous model shows how to estimate a linear model via a maximum likelihood assuming a constant variance.

When allowing for the variance to vary, we have a more general formula for the log likelihood (Equation 4.8 and notebook nb_ch04_04) :

$$loss = NLL = \Sigma_i - log\left(\frac{1}{\sqrt{2 \pi \sigma_{xi}^2 }} \right) + \frac{(y_i -\mu_{xi})^2}{2\sigma_{xi}^2}$$

where subscript $xi$ indicates the conditioning input data point. We can iteratively minimize this loss expression. The output is a conditional mean and conditional variance given input $x$,

A model with non-constant variance shows where the confidence interval is wide or narrow on the input space where the data is available. This approach, however, may not provide the correct information about the uncertainty in the input space where the data is sparse or non-existent. And, that is where a Bayesian approach comes in.

With small sample

This may make sense. But the estimate varies with optimization conditions.

7. Neural-Network Model fitting using Maximum Likelihood (Non-constant variance)

This time let's make everything non-linear, and let the neural network decide how to construct a nonlinear model.

This model learned the nonlinear conditional mean and variance from a single-dimensional input $x$.

With small sample

With a small sample size, this model may be difficult to train. The results are pretty sensitive to training conditions.

We will come back to this using a Bayesian approach (while I am not sure if it can help much in this case).

Does using GradientTape to design a custom training loop help?

7.2 Try GradientTape/Eager training

Looks like it worked okay...

With small sample

It worked!? To be honest, I am not sure why.

8. Conditional Probability Distribution Modeling with TFP

Previously, we looked at linear models to directly estimate linear parameters (slope [$a1$, $a2$, $a3$] and intercept $b$) for inputs [$x$, $x^2$, $x^3$] using a single dense layer with a linear activation function using either the mean square error loss function or a custom negative log-likelihood (NLL) loss function.

We also reviewed using Neural Networks to specify nonlinear models.

Another technique was to specify non-constant, conditional variance via a custom NLL function.
This type of model is capable of estimating conditional mean $\mu_x $ and variance $\sigma_x$ given input $x$.

Next, we turn to estimating the conditional probability distribution of $y$ given $x$, or $p(y|x)$.

For example, we can estimate a model with a Gaussian distribution $p(y|x) = N(\mu_x,\sigma_x)$. This is conceptually equivalent to the previous model of a Gaussian distribution with conditional variance based on the maximum likelihood approach. This time, however, we use a Tensorflow Probability's distribution layer to specify the output that represents the conditional mean and variance given input.

As a result, procedurally the model estimation will differ in the following ways:

# Note: the code below is mostly borrowed from notebook nb_ch08_03 if you want to review the source.

Why does it look so unstable?

It is because the prediction is generated by sampling from the Gaussian distribution as specified in the final layer. All the weights in the network are fixed, but the random draws from the fixed-parameter Gaussian is different for each time for a given input.

If we repeat draws, we get different predictions and calculate the summary distribution for each input data point.

To summarize the results, we take the average of, say, 200 draws and also calculate 2.5% and 97.5% percentiles for a 95-percent confidence interval.

With small sample

Not too bad...

8.2 Try GradientTape/Eager training

Let's try:

hmm...not necessarily better.

With Small Sample

9. Bayesian Neural Network via Variational Inference

Now we train a Bayesian neural network (BNN) via variational inference (VI). Instead of directly estimating weights (i.e. parameters in each layers), we approximate the posterior of the weights with normal distributions.

The normal distribution has two paramters and therefore, we have approximately twice the number of non-Bayesian version's paramters (not exactly double because we don't use a distribution for bias terms).

We can convert the non-Bayesian model above into a BNN-VI model by replacing Dense layers with DenseFlipout layers, which replaces fixed parameters with parameter distributions approximated using KL divergence.

Now we sample mean $\mu$ and standard deviation $\sigma$. As in the previous model, each time we get a different prediction. This time, however, we have another source of uncertainty, namely the random draws of model layer weights.

# Try changing the model structures and optimizers to get a sense of how to train this BNN. To be honest, I don't really understand how to effectively select these and train the model.

With small sample

hmm..

9.2 Training with GradientTape/Eager

That may look pretty good.

With Small Sample

hmm...

10. Bayesian Neural Network via MC Dropout

Alternative approach to implement BNN is to use a Dropout layer after each layer.

This approach randomly drops neurons at certain probabilities at each layer, which in effect implements two-point discrete distributions for each layer's weights (either zero or some non-zero number at some probability p and 1-p respectively).

This approach does not necessarily increase the number of parameters from the non-Bayesian version and is relatively easy to implement (one could choose to increase the number of neurons in the dense layers since some of them will be dropped).

Try changing the model structures and optimizers to get a sense of how to train this BNN. To be honest, I don't really understand how to effectively select these and train the model.

In theory, one could specify models of Non-Bayesian, BNN-IV, and BNN-MC in parallel structures as follows:

With small sample

It appears to show that there is large uncertainty (see the expanded range of vertical axis).

10.2 Training with GradientTape/Eager

It appears okay for the 100-point data.

With Small Sample

Not bad...

11. Revisiting the Maximum likelihood formula with non-constant variance using a Bayesian NN approach

We will convert the previous model of nonlinear maximum likelihood with non-constant variance into a BNN model. This is done by:

# Note: one could also convert it to a BNN-VI model using Flipout layers.

Note that there is no sign of over-fitting in this case.

That seems to work.

With small sample

This may show large uncertainty.